Tuesday, August 28th, 2018

Introduction

  1. The Data Science Process
  2. The Case
  3. The Tidyverse

My Third Favorite Picture:

  • Exploratory data analysis (EDA) covers roughly the "transform" and "visualize" steps.

EDA is…

…getting to know your Data

What We'll Do This Morning

  • Ask an impactful question of a real data set.
  • Use EDA to propose candidate answers.
  • Create a simple business intelligence (BI) dashboard to guide decision-makers.

The Case

The Case

  • AirBnB can maximize its business by ensuring a strong supply of diverse hosts in various neighborhoods.
  • Their question for you: Where should we recruit more hosts?
  • They have given you a complex, multi-part data set to study using R!

FYI: Base R and the Tidyverse

  • You can do EDA in "base" R without any packages.
  • But base R is a bad programming language.
  • We will use the Tidyverse, a set of packages that promote code which is easy to write and read, highly performant, and consistent through the data scientific pipeline.
  • The Tidyverse has been extensively developed by Hadley Wickham and collaborators over the last decade.

If you have prior experience in R and did not begin all your scripts with library(dplyr)….

FYI: Base R and the Tidyverse

Getting Started

  1. Data Import and Inspection
  2. Data Subsetting
  3. The Pipe

Case Study, Part 1

  1. Load libraries
  2. Import and inspect the data
  3. The Nicest Places in JP
  4. The Biggest Places in Back Bay

Pipes for your Data

Pipes for your Data

  • x %>% f() \(\Longleftrightarrow\) f(x)
  • "Take x, and then do f to it"
  • x %>% f(y) \(\Longleftrightarrow\) f(x,y)
  • x %>% f(y) %>% g(z) \(\Longleftrightarrow\) g(f(x,y),z)
  • "Take x, then do f with option y, then do g with option z…"

Some Simple Examples

# familiar
listings %>% glimpse()  # = glimpse(listings)
listings %>% head()     # = head(listings)
listings %>% colnames() # = colnames(listings)

# get all columns with "review_scores" in the column name
listings %>% select(contains('review_scores')) 
# what should this return? 
listings %>% select(contains('review_scores')) %>% colnames()
# compare: colnames(select(listings, contains('review_scores')))

Let's try this out – back to the case study!

Summarising Data

  1. Summary Statistics
  2. Adding Columns
  3. Grouping

Data Summaries

  • You should usually summarise your data before turning on the fancy algorithms – sometimes the story is clear.

Summaries the Tidy Way

  • data %>% mutate(new_col = formula(old_col1, old_col2) creates new columns.
  • data %>% group_by(col) groups data for breakout summaries.
  • data %>% summarise(measure = formula(col1, col2)) computes summaries.
  • data %>% group_by(col) %>% summarise(measure = formula(col1, col2)) computes breakout summaries.
  • Let's test these out in the case study.

Keeping Current

  1. More practice with filter and summarise
  2. joining data

How Recent is our Info?

calendar %>% 
    summarise(earliest = min(date), 
              latest = max(date))
## # A tibble: 1 x 2
##   earliest   latest    
##   <date>     <date>    
## 1 2016-09-06 2017-09-05

But some of these listings may be "zombies" without recent availability. How can we include only listings with availability from a certain time period?

The Approach

  1. Operate on the calendar table (exercise)
  2. join that information to the listings table (together)
  3. Filter the listings table accordingly (together)

Relational Data

The information we need is distributed between two tables – how can we get there?

We need a key column that tells us which calendar rows correspond to which listings.

listings$id corresponds to calendar_listing$id

join

The join family of functions lets us add columns from one table to another using a key.

  • x %>% left_join(y) : most common, keeps all rows of x but not necessarily y.
  • x %>% right_join(y) : keeps all rows of y but not necessarily x.
  • x %>% outer_join(y) : keeps all rows of both x and y
  • x %>% full_join(y) : keeps only rows of x that match in y and vice versa.

We'll use left_join for this case. On to the exercise!

Getting Visual

  1. Graphical Excellence
  2. The Grammar of Graphics
  3. ggplot2

Graphical Excellence

Graphical Excellence

Graphical excellence is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design. Graphical excellence consists of complex ideas communicated with clarity, precision, and efficiency.

Edward Tufte

Graphical Excellence

The Grammar of Graphics

A grammar is a set of components (ingredients) that you can combine to create new things. Many grammars have required components: if you're missing one, you're doing it wrong. In baking….

  • A body – typically some kind of flour)
  • Binder – eggs, oil, butter, applesauce
  • A rising agent – yeast, baking soda, baking powder
  • Flavoring – sugar, salt, chocolate, vanilla

The Grammar of Graphics

  • Puts the gg in ggplot2.
  • Formulated by Leland Wilkinson.
  • Implemented in code by Hadley Wickham, now part of the tidyverse

Ingredients of a data visualization

  • Data: almost always a data_frame
  • Aesthetic mapping: relation of data to chart components.
  • Geometry: specific visualization type? E.g. line, bar, heatmap?
  • Statistical transformation: how should the data be transformed or aggregated before visualizing?
  • Theme: how should the non-data parts of the plot look?
  • Misc. other options.
  • (+ plays the same role in ggplot2 that %>% does in data manipulation.)

First Plot

Does getting lots of reviews usually mean you get good reviews?

listings %>% 
    ggplot()

First Plot

listings %>% 
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating)

First Plot

listings %>% 
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point()

First Plot

listings %>% 
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2) 

First Plot

listings %>% 
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2) + 
    theme_bw()

First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2) + 
    theme_bw() 

First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2) + 
    theme_bw() + 
    labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality') 

First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating) + 
    geom_point(alpha = .2, color = 'firebrick') + 
    theme_bw() + 
    labs(x='Number of Reviews', y='Review Score',title='Review Volume and Review Quality') 

Changing Aesthetics

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = review_scores_value, 
        y = review_scores_location, 
        size = number_of_reviews) + 
    geom_point(alpha = .2, color = 'firebrick') + 
    theme_bw() 

As a Heatmap

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = review_scores_value, 
        y = review_scores_location, 
        fill = number_of_reviews) + 
    geom_tile() + 
    theme_bw() 

Exercise 6

The following code computes the average price of all listings on each day in the data set:

average_price_table <- calendar %>% 
    mutate(price = price %>% gsub('\\$|,', '',.) %>% as.numeric()) %>% 
    group_by(date) %>% 
    summarise(mean_price = mean(price, na.rm = TRUE))

Use geom_line() to visualize these prices with time on the x-axis and price on the y-axis.

Exercise 6 Sample Solution

 average_price_table %>% 
    ggplot() + 
    aes(x = date, y = mean_price) + 
    geom_line()

Exercise 7

Using the summary_table object you created earlier, make a bar chart showing the number of apartments by neighbourhood. In this case, the correct geom to use is geom_bar(stat = 'identity').

Exercise 7 Sample Solution

summary_table %>% 
    filter(property_type == 'Apartment') %>% 
    ggplot() + 
    aes(x = neighbourhood, y= n) + 
    geom_bar(stat = 'identity')

Let's Clean This Up a Bit

summary_table %>% 
    filter(property_type == 'Apartment') %>% 
    ggplot() + 
    aes(x = reorder(neighbourhood, n), y=n) + 
    coord_flip() + 
    geom_bar(stat = 'identity')

Comparisons: Fill, Color, and Facets

From Exercise 7

summary_table %>% 
    ggplot() + 
    aes(x = reorder(neighbourhood, n), y=n, fill = property_type) + 
    coord_flip() + 
    geom_bar(stat = 'identity') 

From Our First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating, color = property_type) + 
    geom_point(alpha = .5) + 
    theme_bw() + 
    labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality') 

From Our First Plot

listings %>% 
    filter(number_of_reviews < 100) %>%
    ggplot() + 
    aes(x = number_of_reviews, y = review_scores_rating, color = property_type) + 
    geom_point(alpha = .5) + 
    theme_bw() + 
    facet_wrap(~property_type) + 
    labs(x='Number of Reviews', y='Review Score', title='Review Volume and Review Quality') 

Optional: Score Types

listings %>% 
    select(number_of_reviews, contains("review_scores"), - review_scores_rating) %>% 
    gather(key = type, value = score, -number_of_reviews) %>% 
    ggplot() + 
    aes(x = factor(score), y = number_of_reviews) + 
    geom_boxplot() + 
    facet_wrap(~type)

Mini-Project

Project Description

  • Working with your partner, you are going to make a business intelligence (BI) dashboard for AirBnB, using the EDA skills we have developed in this session.
  • You will use this dashboard to lead a meeting with decision-makers on where to prioritize host recruitment efforts.
  • It will look like this – but better!

Instructions

  1. Open wrangle_viz/dashboard.Rmd
  2. Click the knit button at the top of RStudio and observe the result. If you see a dashboard, then are good to go.
  3. Modify the dashboard:
    • Include your names in the author metadata up top
    • Write code for data preparation and visualizations.
    • Include all your code in the R "code chunks" that begin with ```{r}
    • Add commentary in the indicated area
    • knit your dashboard again. Save the .Rmd file. We're coming back to it this afternoon!
  4. FILL OUT FEEDBACK SURVEY
  5. Resources: Data wrangling cheatsheet, R graphics cheatsheet, R Graphics Cookbook

Additional Resources

Map of the Tidyverse

Guides and Cheatsheets

Books and Courses

Other Topics in R